Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347kosiew wants to merge 7 commits intoapache:mainfrom
Conversation
Store UDAF return type in Rust accumulator and wrap pyarrow Array/ChunkedArray returns into list scalars for list-like return types. Add a UDAF test to return a list of timestamps via a pyarrow array, validating the aggregate output for correctness.
Add documented list-valued scalar returns for UDAF accumulators, including an example with pa.scalar and a note about unsupported pyarrow.Array returns from evaluate(). Also, introduce a UDAF FAQ entry detailing list-returning patterns and required return_type/state_type definitions.
…ype checking for list types
…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error
|
Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things
It feels like this isn't the best option. I think we want to avoid doing any kind of I think a more general solution would be something like
For the last part we could do something like Additionally, if we're going to go down this route I think we would want to treat both the An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars. What do you think? |
|
One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement |
Which issue does this PR close?
Rationale for this change
Python UDAFs that logically return a list (e.g., “collect all timestamps for a group”) were easy to implement incorrectly by returning a
pyarrow.Array(orChunkedArray) directly fromAccumulator.evaluate(). In that case, DataFusion’s Python ↔ Rust conversion path could attempt to treat the array object as an integer-like scalar, leading to confusing conversion failures such as:ArrowTypeError: object of type <class 'pyarrow.lib.TimestampArray'> cannot be converted to intTo make list-returning UDAFs work reliably (and make the contract explicit), we now ensure the UDAF result is a list-valued scalar that matches the declared Arrow return type.
What changes are included in this PR?
Documentation & API contract clarification
pa.scalarand declare list types forreturn_typeandstate_type).Accumulator.evaluate()docstring to explicitly state that returning apyarrow.Arrayis not supported unless converted to a list-valued scalar.Runtime behavior improvements in the Rust accumulator bridge
RustAccumulator.evaluate, detect when the declared return type is a list (List,LargeList,FixedSizeList) and the Pythonevaluate()result ispyarrow.Array/pyarrow.ChunkedArray.pyarrow.scalar(..., type=<declared list type>)before extracting the scalar value for DataFusion.pyarrow.Arrayandpyarrow.ChunkedArrayPython types inside the accumulator to avoid repeatedpyarrowimports/type lookups on everyevaluatecall.Tests
test_udaf_list_timestamp_return) covering a UDAF that returns a list oftimestamp(ns)values.CollectTimestampsaccumulator used by the test that maintains list state as a list-valued scalar and returns a list-valued scalar fromevaluate().Are these changes tested?
Yes.
list<timestamp(ns)>and that the collected results match the expected list-of-timestamps output.Are there any user-facing changes?
Yes.
pyarrow.Array/ChunkedArrayfromevaluate()for a list-typed UDAF, it is now converted into the correct list-valued scalar automatically (rather than failing with a type conversion error).evaluate()must return a scalar.LLM-generated code disclosure
This PR includes code, comments generated with assistance from LLM. All LLM-generated content has been manually reviewed and tested.